feat(string_util): make ToLower Unicode-aware via utf8proc (2/2) by goel-skd · Pull Request #760 · apache/iceberg-cpp

goel-skd · 2026-06-19T01:30:41Z

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
mapping (e.g. CAFÉ → café, GROẞE → große). Invalid UTF-8 is
returned unchanged rather than erroring.
EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
is case-insensitive for non-ASCII letters too.
ToUpper is intentionally left ASCII-only — it is not used for name
matching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (subprojects/utf8proc.wrap) builds.

Testing
Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
pass-through).

Closes #613.

Follow-up to #748

Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.

wgtmac · 2026-06-19T03:27:28Z

+      target_include_directories(utf8proc::utf8proc INTERFACE ${utf8proc_SOURCE_DIR})
+    endif()
+
+    set(UTF8PROC_VENDORED TRUE)


utf8proc is licensed under the permissive MIT License (along with some Unicode data under a similarly permissive license). We need to update the LICENSE File by adding a separator at the bottom similar to this:

--- This product bundles utf8proc, which is available under the MIT License: Copyright © 2014-2021 by Steven G. Johnson, Jiahao Chen, Tony Kelman, Jonas Fonseca, and other contributors listed in the git history. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software... (include the rest of the utf8proc MIT license text here)

Thanks @wgtmac. Good catch, let me update it.

goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40

goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13

wgtmac reviewed Jun 19, 2026

View reviewed changes

Add license info to LICENSE

b8639d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

goel-skd commented Jun 19, 2026 •

edited

Loading

Uh oh!

wgtmac Jun 19, 2026

Uh oh!

goel-skd Jun 19, 2026

Uh oh!

goel-skd Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

goel-skd commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

wgtmac Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

goel-skd Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

goel-skd Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goel-skd commented Jun 19, 2026 •

edited

Loading